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Recommender systems use data on past user preferences to predict 
possible future likes and interests. A key challenge is that while the 
most useful individual recommendations are to be found among di- 
verse niche objects, the most reliably accurate results are obtained 
by methods that recommend objects based on user or object sim- 
ilarity. In this paper we introduce a new algorithm specifically to 
address the challenge of diversity and show how it can be used to 
resolve this apparent dilemma when combined in an elegant hybrid 
with an accuracy-focused algorithm. By tuning the hybrid appro- 
priately we are able to obtain, without relying on any semantic or 
context-specific information, simultaneous gains in both accuracy 
and diversity of recommendations. 

information filtering | recommender systems | hybrid algorithms 

Getting what you want, as the saying goes, is easy: the 
hard part is working out what it is that you want in the 
first place pQ. Whereas information filtering tools like search 
engines typically require the user to specify in advance what 
they are looking for 2, 3j[4][5], this challenge of identifying 
user needs is the domain of recommender systems [5] |6j [7] [8] , 
which attempt to anticipate future likes and interests by min- 
ing data on past user activities. 

Many diverse recommendation techniques have been de- 
veloped, including collaborative filtering [6] [9], content-based 
analysis [10] . spectral analysis [1111 1 2J . latent semantic models 
and Dirichlet allocation [131 114| . and iterative self-consistent 
refinement [151 1161 117] . What most have in common is that 
they are based on similarity, either of users or objects or both: 
for example, e-commerce sites such as Amazon.com use the 
overlap between customers' past purchases and browsing ac- 
tivity to recommend products [181 119] , while the TiVo digital 
video system recommends TV shows and movies on the ba- 
sis of correlations in users' viewing patterns and ratings [20] . 
The risk of such an approach is that, with recommendations 
based on overlap rather than difference, more and more users 
will be exposed to a narrowing band of popular objects, while 
niche items that might be very relevant will be overlooked. 

The focus on similarity is compounded by the metrics used 
to assess recommendation performance. A typical method of 
comparison is to consider an algorithm's accuracy in repro- 
ducing known user opinions that have been removed from a 
test data set. An accurate recommendation, however, is not 
necessarily a useful one: real value is found in the ability 
to suggest objects users would not readily discover for them- 
selves, that is, in the novelty and diversity of recommenda- 
tion [21]. Despite this, most studies of recommender systems 
focus overwhelmingly on accuracy as the only important fac- 
tor (for example, the Netflix Prize [22] challenged researchers 
to increase accuracy without any reference to novelty or per- 
sonalization of results). Where diversification is addressed, it 
is typically as an adjunct to the main recommendation pro- 
cess, based on restrictive features such as semantic or other 
context-specific information 23 , ,24] . 

The clear concern is that an algorithm that focuses too 
strongly on diversity rather than similarity is putting accu- 



racy at risk. Our main focus in this paper is to show that this 
apparent dilemma can in fact be resolved by an appropriate 
combination of accuracy- and diversity-focused methods. We 
begin by introducing a "heat-spreading" algorithm designed 
specifically to address the challenge of diversity, with high 
success both at seeking out novel items and at enhancing the 
personalization of individual user recommendations. We show 
how this algorithm can be coupled in a highly efficient hybrid 
with a diffusion-based recommendation method recently in- 
troduced by our group [25]. Using three different datasets 
from three distinct communities, we employ a combination of 
accuracy- and diversity-related metrics to perform a detailed 
study of recommendation performance and a comparison to 
well-known methods. We show that not only does the hy- 
brid algorithm outperform other methods but that, without 
relying on any semantic or context-specific information, it can 
be tuned to obtain significant and simultaneous gains in both 
accuracy and diversity of recommendations. 



Methods 

Recommendation procedure. Since explicit ratings are not al- 
ways available [26], the algorithms studied in this paper are 
selected to work with very simple input data: it users, o ob- 
jects, and a set of links between the two corresponding to the 
objects collected by particular users (more explicit preference 
indicators can be easily mapped to this "unary" form, albeit 
losing information in the process, whereas the converse is not 
so). These links can be represented by an o x u adjacency 
matrix A where a a i = 1 if object a is collected by user i 
and a a i = otherwise (throughout this paper we use Greek 
and Latin letters respectively for object- and user-related in- 
dices). Alternatively we can visualize the data as a bipartite 
user-object network with u + o nodes, where the degrees of 
object and user nodes, k a and hi, represent respectively the 
number of users who have collected object a and the number 
of objects collected by user i. 

Recommendation scores are calculated for each user and 
each of their uncollected objects, enabling the construction 
of a sorted recommendation list with the most-recommended 
items at the top. Different algorithms generate different ob- 
ject scores and thus different rankings. 
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Algorithms. The heat spreading (HeatS) algorithm introduced 
here employs a process analogous to heat diffusion across the 
user-object network. This can be related to earlier work us- 
ing a "heat conduction" algorithm to generate recommenda- 
tions [271 128| . but with some key differences. The earlier al- 
gorithm operates on an object-object network derived from 
an explicit ratings structure, which washes out information 
about novelty or popularity of objects and consequently lim- 
its the algorithm to considering questions of accuracy and not 
diversity. The algorithm also requires multiple iterations to 
converge to a steady state. By contrast HeatS requires no 
more than unary data, and generates effective recommenda- 
tions in a single pass. 

HeatS works by assigning objects an initial level of "re- 
source" denoted by the vector / (where fp is the resource 
possessed by object /3), and then redistributing it via the 
transformation / = \N H f, where 
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is a row-normalized ox o matrix representing a discrete anal- 
ogy of a heat diffusion process. Recommendations for a given 
user i are obtained by setting the initial resource vector / l 
in accordance with the objects the user has already collected, 
that is, by setting fl = a^;. The resulting recommendation 
list of uncollected objects is then sorted according to /* in 
descending order. 

HeatS is a variant on an earlier probabilistic spreading 
(ProbS) algorithm introduced by our group [25], which redis- 
tributes resource in a manner akin to a random walk process. 
Whereas HeatS employs a row-normalized transition matrix, 
that of ProbS is column-normalized, 
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with the resource redistribution and resulting object scores 
then being given by / = W p / = (\N H ) T f. 

A visual representation of the resource spreading processes 
of ProbS and HeatS is given in Fig. [1] in ProbS (a-c) the 
initial resource placed on objects is first evenly distributed 
among neighboring users, and then evenly redistributed back 
to those users' neighboring objects. By contrast HeatS (d-f) 
redistributes resource via an averaging procedure, with users 
receiving a level of resource equal to the mean amount pos- 
sessed by their neighboring objects, and objects then receiv- 
ing back the mean of their neighboring users' resource levels. 
(Note that in ProbS total resource levels remain constant, 
whereas in HeatS this is not so.) Due to the sparsity of real 
datasets, these "physical" descriptions of the algorithms turn 
out to be more computationally efficient in practice than con- 
structing and using the transition matrices W p and \N H . 

To provide a point of comparison we also employ two 
methods well-known in the recommender systems literature. 
Global ranking (GRank) recommends objects according to 
their overall popularity, sorting them by their degree k a in de- 
scending order. While computationally cheap, GRank is not 
personalized (apart from the exclusion of different already- 
collected objects) and in most cases it performs poorly. 

A much more effective method is user similarity (USim), a 
well known and widely used technique that recommends items 
frequently collected by a given user's "taste mates" [8]. The 
taste overlap between users i and j is measured by the cosine 
similarity, 
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and if user i has not yet collected object a, its recommenda- 
tion score is given by 
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with the final recommendation list for user i being sorted ac- 
cording to v a i in descending order. 



Hybrid methods. A basic but very general means of creating 
hybrid algorithms is to use weighted linear aggregation |23| : 
if methods X and Y report scores of x a and y a respectively, 
then a hybrid score for object a can be given by 
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where the normalizations address the fact that different meth- 
ods may produce scores on very different scales. By varying 
the parameter A £ [0, 1], we can tune the hybrid X+Y to favor 
the characteristics of one method or the other. 

Though easy to implement, this approach has the disad- 
vantage of requiring two independent recommendation calcu- 
lations, thus increasing computational cost. HeatS and ProbS, 
however, are already fundamentally linked, with their recom- 
mendation processes being determined by different normaliza- 
tions of the same underlying matrix (in fact, their transition 
matrices are the transpose of each other). A much more el- 
egant hybrid can thus be achieved by incorporating the hy- 
bridization parameter A into the transition matrix normaliza- 
tion: 
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where A = gives us the pure HeatS algorithm, and A = 1 
gives us pure ProbS (other hybrid forms are possible but give 
inferior performance: Fig. SI of supporting information [SI] 
provides a comparison of the different alternatives). In con- 
trast to Eq. O this HeatS+ProbS hybrid has a computational 
complexity of order no greater than ProbS or HeatS alone. 
Note that while in the present work A takes a universal value, 
there is no reason in principle why we cannot use different 
values for each individual target user. 



Datasets. Three different datasets (Table [TJ were used to 
test the above algorithms, differing both in subject mat- 
ter (movies, music and internet bookmarks) and in quanti- 
tative aspects such as user/object ratios and link sparsity. 
The first (Netflix) is a randomly-selected subset of the huge 
dataset provided for the Netflix Prize |22| . while the other two 
(RYM and Delicious) were obtained by downloading publicly- 
available data from the music ratings website RateYourMu- 
sic.com and the social bookmarking website Delicious.com 
(taking care to anonymize user identity in the process). 

While the Delicious data is inherently unary (a user has 
either collected a web link or not), the raw Netflix and RYM 
data contain explicit ratings on a 5-star scale. A coarse- 
graining procedure was therefore used to transform these into 
unary form: an object is considered to be collected by a user 
only if the given rating is 3 or more. Sparseness of the datasets 
(defined as the number of links divided by the total number of 
possible user-object pairs) is measured relative to these coarse- 
grained connections. 
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Recommendation performance metrics. To test a recommen- 
dation method on a dataset we remove at random 10% of the 
links and apply the algorithm to the remainder to produce a 
recommendation list for each user. We then employ four dif- 
ferent metrics, two to measure accuracy in recovery of deleted 
links (A) and two to measure recommendation diversity (D): 

(Al) Recovery of deleted links, r. An accurate method will 
clearly rank preferable objects more highly than disliked ones. 
Assuming that users' collected objects are indeed preferred, 
deleted links should be ranked higher on average than the 
other uncollected objects. So, if uncollected object a is listed 
in place p for user i, the relative rank r a i =p/(p — ki) should 
be smaller if a is a deleted link (where objects from places pi 
to p2 have the same score, which happens often in practice, 
we give them all the same relative ranking, |[pi+P2]/[o — ki]). 
Averaging over all deleted links we obtain a quantity, r, such 
that the smaller its value, the higher the method's ability to 
recover deleted links. 

(A2) Precision and recall enhancement, ep(L) and en(L). 
Since real users usually consider only the top part of the rec- 
ommendation list, a more practical measure may be to con- 
sider di(L), the number of user i's deleted links contained in 
the top L places. Depending on our concerns, we may be 
interested either in how many of these top L places are occu- 
pied by deleted links, or how many of the user's D; deleted 
links have been recovered in this way. Averaging these ratios 
di(L)/L and di{L)/Dt over all users with at least one deleted 
link, we obtain the mean precision and recall, P(L) and R(L), 
of the recommendation process [211 129] . 

A still better perspective may be given by considering 
these values relative to the precision and recall of random 
recommendations, P ra nd(£) and R TSl nd{L). If user i has a to- 
tal of Di deleted links, then P r ! and (L) = A/(o — ki) ~ Di/o 
(since in general o 2> ki) and hence averaging over all users, 
Pra.nd{L) = D / (ou) , where D is the total number of deleted 
links. By contrast the mean number of deleted links in the 
top L places is given by LDijio — ki) ~ LDi/o and so 
-Rrand(i) = L/o. From this we can define the precision and 
recall enhancement, 
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Results for recall are given in SI (Figs. S2 and S3), but 
are similar in character to those shown here for precision. 

(Dl) Personalization, h(L). Our first measure of diver- 
sity considers the uniqueness of different users' recommenda- 
tion lists — that is, inter-user diversity. Given two users i and 
j, the difference between their recommendation lists can be 
measured by the inter-list distance, 
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where qij(L) is the number of common items in the top L 
places of both lists: identical lists thus have hij(L) = 
whereas completely different lists have hij(L) — 1. Averaging 
hij(L) over all pairs of users with at least one deleted link 
we obtain the mean distance h(L), for which greater or lesser 
values mean respectively greater or lesser personalization of 
users' recommendation lists. 

(D2) Surprisal/novelty, I(L). The second type of diversity 
concerns the capacity of the recommender system to generate 
novel and unexpected results — to suggest objects a user is 
unlikely to already know about. To measure this we use the 
self-information or "surprisal" [30] of recommended objects, 
which measures the unexpectedness of an object relative to its 



global popularity. Given an object a, the chance a randomly- 
selected user has collected it is given by k a /u and thus its 
self- information is I a — \og 2 (u/k a ). From this we can cal- 
culate the mean self-information h(L) of each user's top L 
objects, and averaging over all users with at least one deleted 
link we obtain the mean top-L surprisal 7(1/). 

Note that unlike the metrics for accuracy, the diversity- 
related measures could be averaged over all users regardless 
of whether they have deleted links or not, but the final results 
do not differ significantly. Where metrics depend on L, differ- 
ent choices result in shifts in the precise numbers but relative 
performance differences between methods remain unchanged 
so long as L <C o. Extended results are available in SI (Figs. 
S4 and S5); a value of L = 20 was chosen for the results dis- 
played here in order to reflect the likely length of a practical 
recommendation list. 



Results 

Individual algorithms. A summary of the principal results for 
all algorithms, metrics and datasets is given in Table [2] 

ProbS is consistently the strongest performer with respect 
to accuracy, with USim a close second, while both GRank and 
HeatS perform significantly worse (the latter reporting par- 
ticularly bad performance with respect to precision enhance- 
ment). By contrast with respect to the diversity metrics HeatS 
is by far the strongest performer: ProbS has some success with 
respect to personalization, but along with USim and GRank 
performs weakly where surprisal (novelty) is concerned. 

That GRank has any personalization at all (h(L) > 0) 
stems only from the fact that it does not recommend items 
already collected, and different users have collected different 
items. The difference in GRank's performance between Net- 
nix, RYM and Delicious can be ascribed to the "blockbuster" 
phenomenon common in movies, far less so with music and 
web links: the 20 most popular objects in Netflix are each 
collected by on average 31.7% of users, while for RYM the 
figure is 7.2% and for Delicious only 5.6%. 

The opposing performances of ProbS and HeatS — the 
former favoring accuracy, the latter personalization and 
novelty — can be related to their different treatment of popular 
objects. The random-walk procedure of ProbS favors highly- 
connected objects, whereas the averaging process of HeatS 
favors objects with few links: for example, in the Delicious 
dataset the average degree of users' top 20 objects as returned 
by ProbS is 346, while with HeatS it is only 2.2. Obviously the 
latter will result in high surprisal values, and also greater per- 
sonalization, as low-degree objects are more numerous and a 
method that favors them has a better chance of producing dif- 
ferent recommendation lists for different users. On the other 
hand randomly-deleted links are clearly more likely to point 
to popular objects, and methods that favor low-degree objects 
will therefore do worse; hence the indiscriminate but populist 
GRank is able to outperform the novelty- favoring HeatS. 

If we deliberately delete only links to low-degree objects, 
the situation is reversed, with HeatS providing better accu- 
racy, although overall performance of all algorithms deterio- 
rates (Table [3] and Fig. S6). Hence, while populism can be a 
cheap and easy way to get superficially accurate results, it is 
limited in scope: the most appropriate method can be deter- 
mined only in the context of a given task or user need. The 
result also highlights the very distinct and unusual character 
of HeatS compared to other recommendation methods. 

Hybrid methods. Given that different algorithms serve differ- 
ent purposes and needs, is it possible to combine two (or more) 
in such a way as to obtain the best features of both? With 
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HeatS favoring diversity and ProbS accuracy, their hybrid 
combination (Eq. [6} might be expected to provide a smooth 
transition from one to the other. In fact, the situation is even 
more favorable: while pure HeatS represents the optimum for 
novelty, it is possible to obtain performance improvements 
relative to all other metrics by tuning the hybridization pa- 
rameter A appropriately (Fig. [2} . The accuracy of ProbS can 
thus be maintained and even improved while simultaneously 
attaining diversity close to or even exceeding that of HeatS. 
Alternatively, diversity can be favored while minimizing the 
cost in terms of accuracy. 

Depending on the particular needs of a system and 
its users, one can define an arbitrary utility function 
U(r,ep ,h, I , L) and choose A to optimize it: Table [4] gives as 
an example the percentage improvements that can be made, 
relative to pure ProbS (A = 1), if we choose A to minimize 
r. Shared improvements are obtained for all metrics except 
with the Delicious dataset, where minimizing r has a negative 
effect on ep(L). However, from Fig.[2]we can see that even in 
this case it is possible to choose a value of A to simultaneously 
improve all metrics relative to ProbS. 

Although HeatS+ProbS provides the best performance 
when taking into account all the metrics, other hybrids 
(constructed using the more general method of Eq. [5| can 
provide some valuable individual contributions (Fig. S7). 
HeatS+USim behaves similarly to HeatS+ProbS, but with 
generally smaller performance improvements. A more inter- 
esting hybrid is to combine the poorly-performing GRank with 
either HeatS or ProbS. These combinations can have a dra- 
matic effect on link recovery: for RYM either can be tuned to 
produce an improvement in r of almost 30% (relative to pure 
ProbS), compared to only 6.8% for the HeatS+ProbS hybrid 
(Table . 

The explanation for these improvements stems from the 
way in which ProbS and HeatS interact with sparse datasets. 
Coverage of uncollected objects is limited to those sharing a 
user in common with an object collected by the target user 
(Fig. [TJ: all others receive a score of zero and so share a com- 
mon (and large) relative rank, r a i = (o — i(Z — l))/(o — ki) 
where Z is the number of objects with zero score. GRank, 
with its universal coverage, is able to differentially rank these 
objects and so lower their contributions to r. Consequently, 
while incorporating it too strongly has a deleterious effect on 
the other metrics, a small GRank contribution can provide a 
useful enhancement to recommendation coverage — notably in 
"cold start" cases where little or nothing is known about a 
user. 

Discussion 

Recommender systems have at their heart some very sim- 
ple and natural social processes. Each one of us looks to 
others for advice and opinions, learning over time who to 
trust and whose suggestions to discount. The paradox is that 
many of the most valuable contributions come not from close 
friends but from people with whom we have only a limited 
connection — "weak ties" who alert us to possibilities outside 
our regular experience |31| . 



The technical challenges facing recommender systems in- 
volve similar paradoxes. The most reliably accurate algo- 
rithms are those based on similarity and popularity of users 
and objects, yet the most valuable recommendations are those 
of niche items users are unlikely to find for themselves |21j . 
In this paper we have shown how this apparent dilemma can 
be resolved by an appropriate combination of diversity- and 
accuracy-focused methods, using a hybrid algorithm that joins 
a method with proven high accuracy with a new algorithm 
dedicated specifically to the production of novel and person- 
alized recommendations. Their combination allows not merely 
a compromise between the two imperatives but allows us to 
simultaneously increase both accuracy and diversity of rec- 
ommendations. By tuning the degree of hybridization the 
algorithms can be tailored to many custom situations and re- 
quirements. 

We expect these results to be general: while we have pre- 
sented a particular set of algorithms and datasets here, other 
recommender systems must face the same apparent dilemma 
and we expect them to benefit from a similar hybrid approach. 
It is interesting to note that while the Netfiix Prize focused 
solely on accuracy, the winning entry in fact took a diversifi- 
cation approach, in this case based on tracking the changes in 
user opinions over time [32j . 

The algorithms presented here rely on no more than unary 
data and can thus place diversity at the heart of the recom- 
mendation process while still being applicable to virtually any 
dataset. More detailed sources of information can neverthe- 
less be used to extend the recommendation process. Topical 
information and other measures of item-item similarity can 
be used to further diversify recommendation lists 24 : user- 
generated classifications such as tags [331 1341 135] may be use- 
ful here. The HeatS and ProbS algorithms, and hence their 
hybrid, can be further customized by modifying the initial 
allocation of resource [36] to increase or decrease the influ- 
ence of selected objects on the recommendation process. The 
hybridization process itself can be extended by incorporating 
techniques such as content-based or semantic analyses [23] . 

The ultimate measure of success for any recommender sys- 
tem is of course in the appreciation of its users, and in partic- 
ular the ability of the system to serve their often very distinct 
needs. While in this paper we have optimized the hybrid 
from a global perspective, there is no reason why it cannot be 
tuned differently for each individual user — either by the sys- 
tem provider or by users themselves. This last consideration 
opens the door to extensive future theoretical and empirical 
research, bringing diversity and personalization not just to the 
contents of recommendation lists, but to the recommendation 
process itself. 
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Fig. 1. The HeatS (a,b,c) and ProbS (d,e,f) algorithms (Eqs. \T\ and [3J at work on the 
bipartite user-object network. Objects are shown as squares, users as circles, with the target 
user indicated by the shaded circle. While the HeatS algorithm redistributes resource via a 
nearest-neighbour averaging process, the ProbS algorithm works by an equal distribution of 
resource among nearest neighbours. 

Table 1. Properties of the tested datasets. 

dataset users objects links sparsity 

Netflix 10 000 6 000 701 947 1.17 • 

RYM 33 786 5 381 613 387 3.37 • 10~ 3 

Delicious 10 000 232 657 1233 997 5.30 ■ 10~ 4 

Table 2. Performance of the recommendation algorithms according to each of the four metrics: recovery of deleted links, 
precision enhancement, personalization, and surprisal. 
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Table 3. Performance of individual rec- 
ommendation algorithms for a probe set 
consisting of only low-degree (k < 100) 
objects. 



method 
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ep(20) 


h(20) 


7(20) 


GRank 


0.327 


0.000 


0.525 


1.68 


USim 
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0.000 


0.579 


1.72 


ProbS 


0.279 
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1.74 


HeatS 


0.262 
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Table 4. Tuning the HeatS+ProbS hybridization pa- 
rameter A to optimize for r produces simultaneous 
improvements in other metrics. The relative changes 
are given in percentage terms against the pure ProbS 
algorithm. 

dataset A 5r 5e P (20) 5h(20) 61(20) 

Netflix 023 106% 163% 283% 28~8% 
RYM 0.41 6.8% 10.8% 20.1% 17.2% 
Delicious 0.66 1.2% -6.0% 22.5% 61.7% 
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Fig. 2. Performance of the HeatS+ProbS hybrid algorithm (Eq.[6j on the three different datasets. By varying the hybridization parameter between pure HeatS (A = 0) 
and pure ProbS (A = 1) it is possible to gain simultaneous performance enhancements with respect to both accuracy (r and ep(L)) and diversity (h(L) and I(L)) of 
recommendations. Tuning A in this fashion allows the algorithm to be customized and optimized for different user or community needs. 
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Supporting information 

Zhou et al., "Solving the apparent diversity-accuracy dilemma 
of recommender systems" 
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Figure SI. Elegant hybrids of the HeatS and ProbS algorithms can be created in several ways besides that given in Eq. 6 of 
the paper: for example W' aj3 = + ±) J2j=i a aj apj/ki, or W£p = (1 _ A)fe 1 Q+Afe)3 E"=i a aj a N /kj. While W' afj performs well 
only with respect to /(20), Eq. 6 and W^p both have their advantages. However, Eq. 6 is somewhat easier to tune to different 
requirements since it varies more slowly and smoothly with A. The results shown here are for the RateYourMusic dataset. 
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Figure S2. Precision P(L) and recall R(L) provide complementary but contrasting measures of accuracy: the former considers 
what proportion of selected objects (in our case, objects in the top L places of the recommendation list) are relevant, the latter 
measures what proportion of relevant objects (deleted links) are selected. Consequently, recall (red) grows with L, whereas 
precision (blue) decreases. Here we compare precision and recall for the HeatS+ProbS hybrid algorithm on the Delicious and 
Netflix datasets. While quantitatively different, the qualitative performance is very similar for both measures. 
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Figure S3. A more elegant comparison can be obtained by considering precision and recall enhancement, that is, their values 
relative to that of randomly-sorted recommendations: ep(L) = P(L) ■ ou/D and ejj(L) = R(L) ■ o/L (Eqs. 7a, b in the paper). 
Again, qualitative performance is close, and both of these measures decrease with increasing L, reflecting the inherent difficulty 
of improving either measure given a long recommendation list. 
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Figure S4. Comparison of the diversity-related metrics h(2Q) and 7(20) when two different averaging procedures are used: 
averaging only over users with at least one deleted link (as displayed in the paper) and averaging over all users. The differ- 
ent procedures do not alter the results qualitatively and make little quantitative difference. The results shown are for the 
RateYour Music dataset. 



Netflix RYM Delicious 




0.5 1 0.5 1 0.5 1 



Figure S5. Comparison of performance metrics for different lengths L of recommendation lists: L = 10 (red), L = 20 (green) 
and L = 50 (blue). Strong quantitative differences are observed for precision enhancement ep(L) and personalization h(L), 
but their qualitative behaviour remains unchanged. Much smaller differences are observed for surprisal I(L). 
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Figure S6. Our accuracy-based metrics all measure in one way or another the recovery of links deleted from the dataset. Purely 
random deletion will inevitably favor high-degree (popular) objects, with their greater proportion of links, and consequently 
methods that favor popular items will appear to provide higher accuracy. To study this effect, we created two special probe sets 
consisting of links only to objects whose degree was less than some threshold (either 100 or 200): links to these objects were 
deleted with probability 0.5, while links to higher-degree objects were left untouched. The result is a general decrease in accuracy 
for all algorithms — unsurprisingly, since rarer links are inherently harder to recover — but also a reversal of performance, with 
the low-degree-favoring HeatS now providing much higher accuracy than the high-degree-oriented ProbS, USim and GRank. 
The results shown here are for the Netflix dataset. 




Figure S7. In addition to HeatS+ProbS, various other hybrids were created and tested using the method of Eq. 5 in the paper, 
where for hybrid X+Y, A = corresponds to pure X and A = 1 pure Y. The results shown here are for the Netflix dataset. 
The HeatS+USim hybrid offers similar but weaker performance compared to HeatS+ProbS; combinations of GRank with other 
methods produce significant improvements in r, the recovery of deleted links, but show little or no improvement of precision 
enhancement ep(L) and poor results in diversity-related metrics. We can conclude that the proposed HeatS+ProbS hybrid is 
not only computationally convenient but also performs better than combinations of the other methods studied. 
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